-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure splits are sorted by path name #200
base: master
Are you sure you want to change the base?
Ensure splits are sorted by path name #200
Conversation
@tomwhite Would it be possible for you to publish a snapshot of this? I'm not sure how to publish hadoop bam snapshots and I was only able to reproduce the original problem on travis. |
@@ -229,18 +230,21 @@ else if (split instanceof FileVirtualSplit) | |||
final List<InputSplit> origSplits = | |||
BAMInputFormat.removeIndexFiles(super.getSplits(job)); | |||
|
|||
final List<InputSplit> sortedSplits = new ArrayList<>(origSplits); | |||
sortedSplits.sort(Comparator.comparing(split -> ((FileSplit) split).getPath())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this cast problematic? I ran into the problem that some splits are FileVirtualSplits
which are not a subtype of FileSplit
, there isn't any common super type that offers getPath
. Maybe FileVirtualSplit
should extend FileSplit
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AnySAMInputFormat
extends Hadoop's FileInputFormat
, which returns FileSplit
objects and never FileVirtualSplit
. So the cast is safe at this point in the code.
@lbergelson Snapshots are only published from master, not for PRs, and I don't know a simple way to publish a snapshot. One way to workaround this is to use a Maven system dependency (just for testing of course). Check in the Hadoop-BAM JAR that you've built locally, then change the dependency in GATK to reference it by path; see an example here: https://github.com/broadinstitute/gatk/compare/tw_squark#diff-c197962302397baf3a4cc36463dce5ea. |
Another way is to use a snapshot built with jitpack.io - https://jitpack.io/#HadoopGenomics/Hadoop-BAM |
Untested fix for #199. @lbergelson are you able to try this out?